I heard about this R course’s usefulness in handling research data quite efficiently so my interest was guaranteed, although it takes some time to get more familiar with this programming language. The idea of this course’s validness for me was through an acquintance.
Looking forward to challenge myself with this totally new thing, and finding some tools for my research data analyzes. More about R Studio you can find from here
students2014 <- read.csv(file = "/Users/streetman/IODS-project/data/lrn14.csv", header = T, sep=",")
students2014 <- students2014[,-1]
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
The data is a subset from the main data, which includes different questions to students and also some demographical variables for the purpose of doing a survey of approaches to learning. This subdata consists 166 rows as observations and 7 columns as variables: gender, Age, Attitude, deep (deep learning related questions), stra (strategic learning related questions), surf (surface learning related questions).
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
Based on visualization we can detect for example that respondends were almost twice as much females (N=110) and their age varied between 17 and 55, and mean age was 25. Regarding correlation the strongest is between attitude and points and lowest between deep and points.
## Fit a linear model
model_fit <- lm(Points ~ Attitude + stra + surf, data = students2014)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
## Fit a linear model
model_fit <- lm(Points ~ Attitude + stra, data = students2014)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
## Fit a linear model
model_fit <- lm(Points ~ Attitude, data = students2014)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The target variable is Points. For the chosen explanatory variables Attitude, stra and surf: with stra and surf there is no statistically significant relationship with the target variable (p=0.117 and 0.466), therefore surf was removed.
After that, with variable stra, there is no statistically significant relationship with the target variable (p=0.089), therefore it was removed.
After that, with variable attitude, there is statistically significant correlation (p = <0.000) to points with an explanatory value of 0.1906 (mult. R-squared) which mean that this variable explains approx. 19% of the points achieved by students.
my_model2 <- lm(Points ~ Attitude, data = students2014)
plot(my_model2, which = c(1,2,5))
The residuals are the differences between the predicted and observed value in a given point. “Residuals vs. Fitted”: assumption that errors don’t depend on variable attitude is valid since the plots are evenly spread without any specific pattern. “Normal Q-Q”: assumption is that the residuals are normally distributed and this supports it. “Residuals vs. Leverage”: in general observations are not having unusually high impact.
getwd()
## [1] "/Users/streetman/IODS-project"
alc <- read.csv(file = "/Users/streetman/IODS-project/data/alc.csv", header = T, sep=",")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
The dataset includes the variables above indicating the questions to students relating to their alcohol usage
library(tidyr)
glimpse(alc)
## Observations: 382
## Variables: 36
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob <fct> teacher, other, other, services, other, other, other,…
## $ reason <fct> course, course, other, home, home, reputation, home, …
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ guardian <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 1…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, …
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
These four variables were chosen: 1. Age (numeric) 2. Famsize (binary: less or equal to 3, or greater than 3) 3. Failures (numeric: 0-3, or 4 if smth else) 4. Absences (numeric: 0 to 93)
Hypothesis: there is a correlation between the high alcohol consumption and the chosen variables when analyzed in a univariate method to observe correlation H0: there is no correlation between a chosen variable and high alcohol consumption H1: there is significant correlation between a chosen variable and high alcohol consumption
summary(alc)
## X school sex age address famsize
## Min. : 1.00 GP:342 F:198 Min. :15.00 R: 81 GT3:278
## 1st Qu.: 96.25 MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104
## Median :191.50 Median :17.00
## Mean :191.50 Mean :16.59
## 3rd Qu.:286.75 3rd Qu.:17.00
## Max. :382.00 Max. :22.00
## Pstatus Medu Fedu Mjob Fjob
## A: 38 Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## T:344 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
For example in variable “absences” the mean absence value is 4.5 and maximum is 45. Range is wide, although 3rd quartile is 6.0, therefore most are within values 0 to 6.0.
library(tidyr); library(dplyr); library(ggplot2)
alc_selected <- select(alc, one_of(c("age", "famsize", "failures", "absences")))
gather(alc_selected) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
For example in variable “age” we can detect that mainly the participants are below 19 years old.
g1 <- ggplot(alc, aes(x = high_use, y = age))
g1 + geom_boxplot(aes(col=sex)) + ylab("age") + ggtitle("Students alcohol consumption and age")
Those with high alcohol consumption were mainly male and their age range is wider compared to female within this consumption category.
g1 <- ggplot(data = alc, aes(x = alc_use, fill = famsize))
g1 + geom_bar()
It seems that more higher the alcohol consumption then more equal are the family sizes. With lower alcohol consumption the family sizes were considerable larger.
alc$failures <- as.factor(alc$failures)
g1 <- ggplot(alc, aes(x = high_use, fill = failures))
g1 + geom_bar()
Somewhat surpringly, in lower alcohol consumption the amount of failures is higher.
g1 <- ggplot(alc, aes(x = high_use, y = absences))
g1 + geom_boxplot(aes(col=sex)) + ylab("absences") + ggtitle("Students alcohol consumption and absences")
It seems that more the absences then more are is the high alcohol consumption with both sexes.
## Fit a linear model
alc$failures <- as.numeric(alc$failures)
alc$famsize <- as.numeric(alc$famsize)
model_fit <- lm(high_use ~ age + famsize + failures + absences, data = alc)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = high_use ~ age + famsize + failures + absences,
## data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9850 -0.2883 -0.2132 0.5052 0.8587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.399649 0.327648 -1.220 0.22332
## age 0.023963 0.019765 1.212 0.22612
## famsize 0.075069 0.050854 1.476 0.14073
## failures 0.106463 0.039562 2.691 0.00744 **
## absences 0.017151 0.004184 4.099 5.09e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4419 on 377 degrees of freedom
## Multiple R-squared: 0.07954, Adjusted R-squared: 0.06977
## F-statistic: 8.144 on 4 and 377 DF, p-value: 2.625e-06
## Fit a linear model
alc$failures <- as.numeric(alc$failures)
model_fit <- lm(high_use ~ age + failures + absences, data = alc)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = high_use ~ age + failures + absences, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0151 -0.2787 -0.2162 0.5361 0.8393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.308391 0.322265 -0.957 0.33920
## age 0.024304 0.019795 1.228 0.22029
## failures 0.104492 0.039601 2.639 0.00867 **
## absences 0.017366 0.004188 4.146 4.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4426 on 378 degrees of freedom
## Multiple R-squared: 0.07422, Adjusted R-squared: 0.06687
## F-statistic: 10.1 on 3 and 378 DF, p-value: 2.045e-06
## Fit a linear model
alc$failures <- as.numeric(alc$failures)
alc$famsize <- as.numeric(alc$famsize)
model_fit <- lm(high_use ~ failures + absences, data = alc)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = high_use ~ failures + absences, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0035 -0.2801 -0.2127 0.5510 0.8053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.081634 0.054308 1.503 0.13363
## failures 0.113115 0.038999 2.900 0.00394 **
## absences 0.017973 0.004162 4.319 2.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4429 on 379 degrees of freedom
## Multiple R-squared: 0.07052, Adjusted R-squared: 0.06562
## F-statistic: 14.38 on 2 and 379 DF, p-value: 9.573e-07
The target variable is high_use (of alcohol). For the chosen explanatory variables age, family size, failures and absences: with age and family size there is no statistically significant relationship with the target variable (p=0.089 and 0.170), therefore family size was removed.
After that, with variable age, there is no statistically significant relationship with the target variable (p=0.089), therefore it was removed.
After that, with variables failures and absences, there is statistically significant correlation (p=0.004 and <0.000) to high_use (of alcohol) with an explanatory value of 0.07 (mult. R-squared) which means that these variables explains approx. 7% of the high_use (of alcohol) among the participants.
### Fitting the logistic regressio model
m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
### Compute odds ratios (OR)
OR <- coef(m) %>% exp
### Compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
### Printing the output
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1528951 0.08788853 0.2604241
## failures 1.6482547 1.14889905 2.3830880
## absences 1.0898257 1.04433337 1.1423489
Both variables’, failures and absences, ORs indicate higher odds for high_use (of alcohol) within these variables; yet, the CI crosses value 1 with the variable failures, therefore it seems to inadequate for showing statistical significance in this sense. By this analysis, from the chosen variables, absenses seems to be the one with most statistical significance in its relation to high_use (of alcohol).
## Fit the model
m <- glm(high_use ~ absences, data = alc, family = "binomial")
## Predict() the probability of high_use
probabilities <- predict(m, type = "response")
## Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
## Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
## Tabulate the target variable versus the predictions
results <- table(high_use = alc$high_use, prediction = alc$prediction)
results
## prediction
## high_use FALSE TRUE
## FALSE 263 5
## TRUE 105 9
tab <- table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
tab
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68848168 0.01308901 0.70157068
## TRUE 0.27486911 0.02356021 0.29842932
## Sum 0.96335079 0.03664921 1.00000000
## Total proportion of inaccurately classified individuals
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2879581
After LR analysis, it was stated that variable “absences” has the most statistical significance it relation to high_use (of alcohol), therefore chosen here to test the predictive power of the LR model.
After calculation, the value of 0.2879 indicates that approx. 29% are incorrect predictions with this model, yet 71% are correct ones, therefore this LR model is useful.
install.packages(“MASS”)
install.packages(“corrplot”)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(corrplot)
## corrplot 0.84 loaded
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The dataset has 14 columns (variables) and 506 rows (observations). The data is about housing values in Boston suburbs. More about the data variables you can find here
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor_matrix <- cor(Boston)
cor_matrix %>% round(2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
"VISUALIZING the correlations"
## [1] "VISUALIZING the correlations"
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Crime rate seems to have strongest positive correlation to property taxation rate and highways accessibility.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
We scaled the data to get standardization for to be able to compare variables in a better way.
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low","med_low","med_high","high"), include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
nrow(boston_scaled)
## [1] 506
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
summary(test)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.4470 Min. :-0.27233 Min. :-1.4040
## 1st Qu.:-0.48724 1st Qu.:-0.9554 1st Qu.:-0.27233 1st Qu.:-0.9272
## Median :-0.48724 Median :-0.4368 Median :-0.27233 Median :-0.3426
## Mean : 0.08004 Mean :-0.1448 Mean :-0.04074 Mean :-0.1354
## 3rd Qu.: 0.37030 3rd Qu.: 1.0150 3rd Qu.:-0.27233 3rd Qu.: 0.4169
## Max. : 3.37170 Max. : 2.1155 Max. : 3.66477 Max. : 2.7296
## rm age dis
## Min. :-3.87641 Min. :-2.0809 Min. :-1.19192
## 1st Qu.:-0.56878 1st Qu.:-0.9459 1st Qu.:-0.74287
## Median :-0.17098 Median : 0.2886 Median :-0.19537
## Mean :-0.06577 Mean :-0.0556 Mean : 0.08213
## 3rd Qu.: 0.37768 3rd Qu.: 0.8926 3rd Qu.: 0.79494
## Max. : 2.81998 Max. : 1.1164 Max. : 3.28405
## rad tax ptratio
## Min. :-0.98187 Min. :-1.30676 Min. :-2.51994
## 1st Qu.:-0.63733 1st Qu.:-0.78313 1st Qu.:-0.48756
## Median :-0.52248 Median :-0.39895 Median : 0.08983
## Mean :-0.08562 Mean :-0.08546 Mean : 0.04137
## 3rd Qu.:-0.17794 3rd Qu.: 0.17066 3rd Qu.: 0.80578
## Max. : 1.65960 Max. : 1.52941 Max. : 1.63721
## black lstat medv
## Min. :-3.903331 Min. :-1.50301 Min. :-1.68888
## 1st Qu.: 0.207114 1st Qu.:-0.79478 1st Qu.:-0.58799
## Median : 0.393297 Median :-0.33161 Median :-0.22646
## Mean : 0.007433 Mean :-0.02143 Mean :-0.01273
## 3rd Qu.: 0.440616 3rd Qu.: 0.42213 3rd Qu.: 0.28457
## Max. : 0.440616 Max. : 3.40663 Max. : 2.98650
We made a categorical variable of crime rate with the division into quantiles indicated above. We also made datasets (“training” and “test”) for testing and implementing the LDA (linear discriminant analysis) model.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2500000 0.2450495 0.2648515
##
## Group means:
## zn indus chas nox rm
## low 0.93521987 -0.8724530 -0.10997442 -0.8493333 0.5037053
## med_low -0.09731296 -0.2386718 -0.03844192 -0.5495981 -0.1647176
## med_high -0.37290116 0.1482665 0.20489520 0.4004736 0.1364623
## high -0.48724019 1.0170108 -0.01476176 1.0473201 -0.3647096
## age dis rad tax ptratio
## low -0.8456021 0.8258700 -0.6965304 -0.7626577 -0.45088994
## med_low -0.3702610 0.3890044 -0.5497747 -0.4593960 -0.05857918
## med_high 0.4097210 -0.3830596 -0.4401196 -0.3332586 -0.38631073
## high 0.7899875 -0.8397445 1.6392096 1.5148289 0.78203563
## black lstat medv
## low 0.38099714 -0.78077152 0.56362182
## med_low 0.34657072 -0.08564246 -0.03564772
## med_high 0.06246724 -0.03002024 0.19741850
## high -0.73740944 0.83685068 -0.64781696
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07058871 0.72066372 -0.98831510
## indus 0.04180128 -0.07211078 0.39155827
## chas -0.13052780 -0.05246700 0.05704959
## nox 0.34196225 -0.79819437 -1.09728588
## rm -0.09005854 -0.08489643 -0.16074014
## age 0.21827904 -0.30988454 -0.24464117
## dis -0.03469009 -0.17988822 0.40345681
## rad 3.72467423 0.91409926 -0.21900898
## tax 0.06866486 -0.06288839 0.65027508
## ptratio 0.11073825 0.07609792 -0.26913691
## black -0.09532881 0.09467939 0.20523587
## lstat 0.22226254 -0.19752611 0.49478322
## medv 0.19276678 -0.36854075 -0.08056736
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9611 0.0273 0.0116
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Here we tested the LDA model to the “training” dataset, and by visualization we detect distintiveness between crime rate groups with some amount of overlapping between them.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 20 9 1 0
## med_low 5 13 7 0
## med_high 1 9 14 3
## high 0 0 0 20
Here we implemented the LDA to “test” dataset. It indicates that higher crime values are better predicted than lower ones. As a result, not the optimal prediction model, but fairly good.
data(Boston)
boston_scaled_2 <- scale(Boston)
str(boston_scaled_2)
## num [1:506, 1:14] -0.419 -0.417 -0.417 -0.416 -0.412 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:506] "1" "2" "3" "4" ...
## ..$ : chr [1:14] "crim" "zn" "indus" "chas" ...
## - attr(*, "scaled:center")= Named num [1:14] 3.6135 11.3636 11.1368 0.0692 0.5547 ...
## ..- attr(*, "names")= chr [1:14] "crim" "zn" "indus" "chas" ...
## - attr(*, "scaled:scale")= Named num [1:14] 8.602 23.322 6.86 0.254 0.116 ...
## ..- attr(*, "names")= chr [1:14] "crim" "zn" "indus" "chas" ...
dim(boston_scaled_2)
## [1] 506 14
dist_eu <- dist(boston_scaled_2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man <- dist(boston_scaled_2, method = "manhattan")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
km <- kmeans(boston_scaled_2, centers = 3)
pairs(boston_scaled_2, col = km$cluster)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <- kmeans(boston_scaled_2, centers = 2)
pairs(boston_scaled_2, col = km$cluster)
Here we clustered the data by loading it again and concluded that the it is the most optimally clustered via two clusters as indicated in the graph above where curve changes the most at this point.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(GGally)
library(corrplot)
library(tidyr)
library(ggplot2)
getwd()
## [1] "/Users/streetman/IODS-project"
human <- read.csv(file = "/Users/streetman/IODS-project/data/human.csv", header = T, sep=",")
dim(human)
## [1] 155 9
str(human)
## 'data.frame': 155 obs. of 9 variables:
## $ X : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 27 102 ...
## $ Edu2.FM : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
We can detect for example expected years of education is 4.10 years at minimum with a median of 8.4 years.
human$GNI <- as.integer(human$GNI)
human_ <- human[,-1]
dim(human_)
## [1] 155 8
str(human_)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 132 109 124 112 113 111 103 123 108 93 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
ggpairs(human_)
cor(human_) %>% corrplot(method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
pca_human <- prcomp(human_)
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col=c("grey40", "deeppink2"))
human_std <- scale(human_)
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-1.76122 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.91250 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.03967 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.96275 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 1.44288 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-1.7154 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.8577 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median : 0.0000 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8577 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 1.7154 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col=c("grey40", "deeppink2"))
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
library(dplyr)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")